FM-KZ: An even simpler alphabet-independent FM-index

نویسندگان

  • Rafal Przywarski
  • Szymon Grabowski
  • Gonzalo Navarro
  • Alejandro Salinger
چکیده

In an earlier work [6] we presented a simple FM-index variant, based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream indicating the Huffman codeword boundaries. In this way, the resulting index needed O(n(H0+1)) bits of space but with the constant 2 (concerning the main term). There are several options aiming to mitigate the overhead in space, with various effects on the query handling speed. In this work we propose Kautz-Zeckendorf coding as a both simple and practical replacement for Huffman. We dub the new index FM-KZ. We also present an efficient implementation of the rank operation, which is the main building brick of the FM-KZ. Experimental results show that our index provides an attractive space/time tradeoff in comparison with existing succinct data structures, and in the DNA test it even wins both in search time and space use. An additional asset of our solution is its relative simplicity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

First Huffman, Then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index

Main Results. The basic string matching problem is to determine the occurrences of a short pattern P = p1p2 . . . pm in a large text T = t1t2 . . . tn, over an alphabet of size σ. Indexes are structures built on the text to speed up searches, but they used to take up much space. In recent years, succinct text indexes have appeared. A prominent example is the FM-index [2], which takes little spa...

متن کامل

A simple alphabet-independent FM-index

We design a succinct full-text index based on the idea of Huffmancompressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, wi...

متن کامل

List of Contributions The Pre - history and Future of the Block - Sorting Compression Algorithm 4

The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our...

متن کامل

An Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set

String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...

متن کامل

An Alphabet-Friendly FM-Index

We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T )+O ( (n log log n)/ log|Σ| n ) bits, where Hk(T ) is the k-th order empirical entropy of T . The above bound h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006